Quantitative Analysis

Michael Girdwood

…for those who missed last week:

Link to slides on data management

Data Analysis

Data Cleaning

When is my data clean?

  • all variables are correctly named, coded and organised
  • data is organised as rectangles, linked by id variables
  • all data sources have been linked
  • all unnecessary information removed from data sheet
  • data is stored securely in a master file
  • data collection has been completed*

Remember!

Don’t work on your master file

  • make a copy, and only select the variables you need
  • keep raw data raw (no calculations etc)
  • the calculations come during your analysis

First steps - Checks

  • Check that everything ‘looks right’
  • Check your missing values, are they consistent?
  • Check that each participant has the right amount of data - e.g. are all timepoints and variables included
  • Visualise your Data

Data types



%%{init: {'theme': 'base', 'themeVariables': { 'fontSize': '40px'}}}%%
flowchart TD
A[Data types] --> B[Categorical]
B --> C(Nominal)
B --> D(Ordinal)
A --> E(Numerical)
E --> F(Discrete)
E --> G(Continuous)

Data types

Nominal

  • A categorical variable
  • Often (but not always, binary)
  • order of the categories is meaningless

Ordinal

  • An ordered category
  • The order or ranks mean something

Discrete

  • Numerical values, with defined possibilities or limits
  • i.e. a count (can’t have a decimal)
  • often positive integers

Continuous

  • Numerical values, that we use mathetmatical operations with
  • often numbers with decimals

Data types

Nominal

  • group: [study group, control group]
  • sex: [male, female]

Ordinal

  • How strongly do you agree… [strongly disagree, disagree, agree, strongly agree]
  • severity of disease: [grade 1, grade 2]

Discrete

  • counts [1,2,3,4...]
  • number of visits

Continuous

  • height [173.1, 181.3, 193.0 ...]
  • weight
  • age

Data Analysis - Definitions

Describing Numbers

When we report on data we want to provide summary information of it, without having to provide the entire dataset itself.

Common things to report:

  • Typical, centre or most common values
  • The spread or how variable some data are
  • The limits or extremes of the data
  • The distribution (this tries to encompass all of the above)

Typical, centre or most common value:

i.e. measures of central tendancy

Mean

  • “average”

  • sum of all values / the number of values

    [1, 6, 2, 7, 4, 7, 3]

(1+6+2+7+4+7+3) / 7 = 4.286

Typical, centre or most common value:

Median

  • “middle number”

    [1, 2, 3, 4 , 6, 7, 7] = 4


    [4, 5, 7 | 8, 10, 12] = 7.5

Typical, centre or most common value:

Mode

  • “most common number”

    [1, 6, 2, 7, 4, 7, 3]

    = 7

Central tendancy



age = [18, 19, 20, 18, 21, 22, 23, 65]

mean(age) = 25.75

median(age) = 20.5

mode(age) = 18

  • Mean can be influenced by large outliers and extremes
  • Mode most useful with count data

Measures of spread

Quartiles

  • “quarters” i.e. splitting your data up into quarters.
  • Q1 = where the first quarter ends
  • Q3 = where the third quarter ends
  • Q1 to Q3 = the interquartile range = the middle half of the data


[14,15,16,17, 18,19,20,21, 22,25,27,28, 30,32,34,40,55]

Measures of spread

Standard deviation

  • A measure of variation around the mean

[2,5,3,4,7,1,8,10]

mean = 5

Measures of spread

Standard deviation

Value
2
5
3
4
7
1
8
10

Measures of spread

Standard deviation

Value Deviation
2 -3
5 0
3 -2
4 -1
7 2
1 -4
8 3
10 5
Total 0

Measures of spread

Standard deviation

Value Deviation Deviation2
2 -3 9
5 0 0
3 -2 4
4 -1 1
7 2 4
1 -4 16
8 3 9
10 5 25
Total 0 68
  • Variance = sum(Deviation^2) / n-1

  • Variance = 68 / 7

  • Variance = 9.714

  • SD = squareroot(variance)

  • SD = 3.117


Summary:

mean 5 (SD 3.117)

Measures of the extremes


Minimum and maximum

  • the highest and lowest value

Range

  • minimum to the maximum

Both can be influenced by outliers and extreme values

Calculation time!


Measure Excel function
Mean AVERAGE()
Median MEDIAN()
Mode MODE()
Q1 QUARTILE(…, 1)
Q3 QUARTILE(…, 3)
Standard Deviation STDEV()
Minimum MIN()
Maximum MAX()

Your turn

Link to this weeks data

Standard error of the mean

  • how far away the sample mean is likely to be from the true population mean

  • Standard Deviation divided by the sample size

\[SE = \frac{SD}{\sqrt(n)}\]

Confidence Intervals

“If we repeated the same measure again in future, 95% of the time the 95% confidence interval will cover the parameter (point estimate)”

Confidence Intervals

“If we repeated the same calculation again in future on different samples, 95% of the time the 95% confidence interval will cover the parameter (point estimate)”

Common misunderstandings of CIs:

  • does not mean 95% of the data lie between these intervals
  • does not mean that there is a 95% chance that the estimate lies within the interval
  • does not mean that there is a 95% chance that in future the estimate lies within the current interval

Statistical Software

Jamovi

https://www.jamovi.org/

Free and open source statistical software

Designed to be easy to use, and with an interface similar to outer statistical programs

User Guide for Jamovi

Jamovi - your turn

  • import the same data sheet
  • calculate some of the descriptive statistics from before

Data Visualisation

Data Visualisation

Data Visualisation

  • Communicating graphic representations of data
  • understand your data by “seeing” it

Uses:

  • data checking
  • investigating outliers
  • presenting results
  • communicating and engaging with stakeholders

Types of plots

Financial Times - Visual Vocabulary

Plot Advice

A plot should be easy to understand in just a few seconds…

Plot Advice

A plot should be easy to understand in just a few seconds…

Plot Advice

A plot should be easy to understand in just a few seconds…

Plot Advice


Some simple advice from Murrell, 2013:

  • Display data values using position or length
  • Use horizontal lengths in preference to vertical lengths
  • Watch your data-ink ratio
  • Think very carefully before using colour to represent data values
  • Do not use areas to represent values
  • Please do not use angles or slopes to represent data values
  • Please, please do not use volumes to represent values

Constructing a box plot

cyl mpg
Mazda RX4 6 21.0
Mazda RX4 Wag 6 21.0
Datsun 710 4 22.8
Hornet 4 Drive 6 21.4
Hornet Sportabout 8 18.7
Valiant 6 18.1
Duster 360 8 14.3
Merc 240D 4 24.4
Merc 230 4 22.8
Merc 280 6 19.2

Constructing a box plot

Constructing a box plot

Constructing a box plot

Constructing a box plot

Your turn!

A teaser on some analyses…

Disclaimer:

  • Statistics is a very complicated topic, with increasingly complex methods to analyse data
  • Often there is not a one best approach to analysis
  • It is critical that you understand on a basic level what a statistical test does
  • Because of this, it is unfeasible for me to go into detail about how to run all of these tests…

Disclaimer:

My suggestion:

  • Look at taking a beginner statistics course - university / book / online - to get a basic overview of everything
  • When you get to your analysis in your project - work together with your team to learn a common approach to analysing your particular research question

Statistics Inference

  • The art and science of using data from a sample to make inferences on a population, based on certain characteristics from the data

  • A common approach to this is called frequentist inference - which frames an analysis in terms of how likely (or unlikely) a result is to occur.

    • often compared to a ‘null hypothesis’ - e.g. zero effect, no difference

Statistics Inference

  • Tests are sometimes referred to as ‘statistically significant’
    • this means that based on the calculation, what you found is considered to be different to the null, and is unlikely to be random chance.
  • statistically significant ≠ clinically significant

Statistical definitions


Dependant variable Independant variable
Outcome variable Explanatory variable
Response variable Predictor variable
y-variable x-variable


Blood pressure ← Diabetes Status

Jumping height ~ Gender

Differences between two groups - categorical data

Difference in proportions

Diabetes Status
gender No Yes Total
female 18 8 26
male 14 10 24
Total 32 18 50


  • Chi-square test (χ2)
  • McNemar’s test

Differences between two groups - continuous data

Difference in means


Diabetes Cholesterol
Mean (SD) No 176 (15.4)
Yes 212 (17.4)


  • Students t-test
  • Wilcoxon-Mann Whitney U test

Differences between many groups

Difference in many means

  • Analysis of variance (ANOVA)

Relationships between variables

Correlation

A crude measure of how related 2 variables are

Relationships between variables

Correlation - be careful!

Relationships between variables

Regression Modelling

  • A more complex approach to looking at relationships between variables

  • Model the effect of different explanatory variables on an outcome variable

  • Allows for

    • different types of relationships
    • controlling for different variables (i.e. more than one variable at a time)

Resources

For more complicated data work, or if you are considering a research career consider learning a scripting language such as R, python etc

Thank you

Any questions?